ddl: fail stale backfill task meta#68842
Conversation
|
@zeminzhou I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details. ⏳ This process typically takes 10-30 minutes depending on the complexity of the changes. ℹ️ Learn more details on Pantheon AI. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @zeminzhou. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
💤 Files with no reviewable changes (3)
📝 WalkthroughWalkthroughAdds an unexported sentinel error for outdated backfill task metadata, annotates executor "index info not found" failures with it, makes retryability checks treat that sentinel as non-retryable, and adds tests verifying non-retryability including when wrapped. ChangesStale Backfill Task Metadata Handling
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/ddl/index.go (1)
3107-3115: ⚡ Quick winConsider adding a log message when stale task metadata is detected.
The code correctly validates and fails stale tasks, but a dedicated log message at this point would improve observability when debugging production incidents.
📋 Suggested log addition
if err := validateBackfillTaskMeta(task, reorgInfo); err != nil { + logutil.DDLLogger().Warn("resuming task with stale metadata, marking as failed", + zap.Int64("taskID", task.ID), zap.String("taskKey", task.Key), + zap.Int64("jobID", reorgInfo.Job.ID), zap.Error(err)) if !task.TaskBase.IsDone() {🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/ddl/index.go` around lines 3107 - 3115, When validateBackfillTaskMeta(task, reorgInfo) returns an error and you proceed to fail the task via taskManager.FailTask(w.workCtx, task.ID, task.State, err), add a structured log entry before returning that records the stale metadata error and task identifiers; specifically log the error value, task.ID, task.State and whether task.TaskBase.IsDone() so operators can trace why validateBackfillTaskMeta failed—place the log just after the validateBackfillTaskMeta check and before calling taskManager.FailTask/handle.NotifyTaskChange.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/ddl/index.go`:
- Around line 3107-3115: When validateBackfillTaskMeta(task, reorgInfo) returns
an error and you proceed to fail the task via taskManager.FailTask(w.workCtx,
task.ID, task.State, err), add a structured log entry before returning that
records the stale metadata error and task identifiers; specifically log the
error value, task.ID, task.State and whether task.TaskBase.IsDone() so operators
can trace why validateBackfillTaskMeta failed—place the log just after the
validateBackfillTaskMeta check and before calling
taskManager.FailTask/handle.NotifyTaskChange.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 4ecb5d64-f8e6-4d7f-8071-045bc8b7e986
📒 Files selected for processing (3)
pkg/ddl/backfilling_dist_executor.gopkg/ddl/backfilling_test.gopkg/ddl/index.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #68842 +/- ##
================================================
+ Coverage 76.3104% 76.3860% +0.0755%
================================================
Files 2041 2054 +13
Lines 563452 583211 +19759
================================================
+ Hits 429973 445492 +15519
- Misses 132563 134982 +2419
- Partials 916 2737 +1821
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| if err := json.Unmarshal(task.Meta, taskMeta); err != nil { | ||
| return errors.Trace(err) | ||
| } | ||
| if err := validateBackfillTaskMeta(task, reorgInfo); err != nil { |
There was a problem hiding this comment.
why we need to validate and fail the task here, if DXF subtask failed as we have make the error not retryable, the whole task failed, then DDL part will notices that it's reverted and fail the DDL job
Co-authored-by: D3Hunter <jujj603@gmail.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/ddl/backfilling_dist_executor.go (1)
20-20:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winRemove unused
stringsimport to fix build break.Line 20 imports
strings, but it is no longer referenced. Go will fail compilation with an unused import error.Proposed fix
import ( "context" "encoding/json" - "strings" "github.com/pingcap/errors"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/ddl/backfilling_dist_executor.go` at line 20, Remove the unused "strings" import from the import block in backfilling_dist_executor.go to fix the Go build error; locate the import list in that file (the line currently reading "strings") and delete that entry so only used packages remain imported.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@pkg/ddl/backfilling_dist_executor.go`:
- Line 20: Remove the unused "strings" import from the import block in
backfilling_dist_executor.go to fix the Go build error; locate the import list
in that file (the line currently reading "strings") and delete that entry so
only used packages remain imported.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 08471b1e-fb75-4e57-bbc6-47932041ff9d
📒 Files selected for processing (1)
pkg/ddl/backfilling_dist_executor.go
What problem does this PR solve?
Issue Number: close #68828
Problem Summary:
Distributed add-index backfill can pick up an existing DXF task by task key and resume it without checking whether the persisted
BackfillTaskMetastill matches the current DDL job and reorg elements. If the old task meta contains staleEleIDs, the executor repeatedly returnsindex info not found, and the error is treated as retryable, so the task can retry forever without making progress.What changed and how does it work?
backfill task meta is outdatederror for stale DXF backfill task metadata.reorgInfo.index info not foundduring distributed backfill executor setup into the same non-retryable stale-meta error.Check List
Tests
Test commands:
./tools/check/failpoint-go-test.sh pkg/ddl -run 'TestOutdatedBackfillTaskMetaIsNonRetryable|TestValidateBackfillTaskMeta' -count=1 make bazel_prepare make lintSide effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit
Bug Fixes
Tests